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Abstract 

Various factors are believed to govern the selection of references in citation networks, but a 
precise, quantitative determination of their importance has remained elusive. In this paper, we 
show that three factors can account for the referencing pattern of citation networks for two topics, 
namely "graphenes" and "complex networks", thus allowing one to reproduce the topological 
features of the networks built with papers being the nodes and the edges established by citations. 
The most relevant factor was content similarity, while the other two - in-degree (i.e. citation 
counts) and age of publication had varying importance depending on the topic studied. This 
dependence indicates that additional factors could play a role. Indeed, by intuition one should 
expect the reputation (or visibility) of authors and/or institutions to affect the referencing pattern, 
and this is only indirectly considered via the in-degree that should correlate with such reputation. 
Because information on reputation is not readily available, we simulated its effect on artificial 
citation networks considering two communities with distinct fitness (visibility) parameters. One 
community was assumed to have twice the fitness value of the other, which amounts to a double 
probability for a paper being cited. While the h-index for authors in the community with larger 
fitness evolved with time with slightly higher values than for the control network (no fitness 
considered), a drastic effect was noted for the community with smaller fitness. 
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1. Introduction 

Quantitative evaluations of researchers, institutions, geographical regions, journals and areas 
of science and technology have become commonplace, especially with the widespread avail- 
ability of information in scientific databases. Citation counts and impact factors are among 
the most common parameters used and may be key for deciding on promotions, grants and 
ident ification of scientific trends. Science has become to a certain extent driven by scientom- 
etry dBai l bOOStlBornmann. Schier. Marx. & DanielL l2012HGarfieldlll972h . which is motivation 



for d etailed studies of the way scien tometric parameters are defined and of patterns of cita- 



tions dRotha. Wuc. & Lozanodil2012h . Citation networks, for instance, have been modeled with 
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concepts and methodologies of complex netw orks ("Albert & Barabasi', '2002'; 'Newman', '2010t 
Boccaletti. Latora. Moreno. Chavez. & Hwangll2006;, Costa. Rodrigues. Travieso. & Villas Boasl 



20071 lNewmanll2003b . The degree of these networ ks (i.e. the number of citations rece i ved by 
papers) was found to follow the scale free behavior (IBarabasi & Bonabeaui 120031 : iPricel 1 1965b . 
which amou nts to say that the probabilit y of a paper being cited was dependent on its cita 



tion counts (Newman, 2010 : Pricel T965I). Also known is that content similarity plays a role 



on the choice of references (IMencze r. 2004), even though the correlation with t he most related 
papers has been found to be low dAmancio. Nunes. Oliveira Jr.. & Costa , 2012). Other factors 



considered to affect the citation pattern are the age of publication, since recent papers are more 
likely to be cited than old ones ( Geng & Wangl 20091 : Kamalika , 2005 ). the reputation of au- 



thors, j ournals and institutions, and even the auth ors' language as they affect the readability of 
papers (IBommann. Schier. Marx. & Danielll2012l) . 

With the variety of possible factors, modeling citation networks has not been straightfor- 
ward. Traditional models considering one feature at a time may be successful in explaining the 
dynamic s of this feature, but could on the other hand miss out in important points on ove rlooked 
features (iMenczen 120041) . The preferential attachment model ((Albert & Barabasil l2002h . for in- 
stance, predicts the degree distribution of th e networks, but fails to match the actual content 
simila rity of real data bases (IMenczen 120041) . Other methods also ex plain the degree distribu- 
tions dMenczen, |2004|) or clustering coefficient dWu & Holme , l2009l) . but not the content sim- 
ilarity a nd distribution o f the time difference between papers and their references. According 
to Ref. dMenczen, 120041) these features follow well-known distributions. The content similar- 
ity obeys a Gaussian-like distribution, while the age dependence distribution follows a power 
law dNewman . ,2005.) . Therefore, in the attempts to model citation networks one should consider 
as many features as possible. In this paper, we propose a model that takes into account three 
factors believed to affect the pattern of citations, namely the in-degree distribution, the content 
similarity and the age of publication. We shall show that this model is capable of reproducing 
topological characteristics of citation networks obtained for two topics in the arXixQ repository. 
Because it is hard to quantify the reputation or visibility of journals or institutions, this factor 
could not be included in the model. Alternatively, we designed artificial networks with two com- 
munities of authors differing in their visibility (fitness), i.e. with different probabilities of having 
their papers being cited. We shall show that differences in fitness cause major effects on the 
temporal evolution of h-index (iBalL,2005tlCostas & Bordonsll2007l : iHirsck ,2005i) of authors. 



2. Modeling Citation Networks 

We propose a model to describe features of citation networks in which three parameters 
are assumed to govern the network, namely topology, content similarity (semantics) and age of 
publication. Simulated networks were then created with the citations being selected according 
to one of these criteria, and then with a combination of the three criteria. The content similarity 
was computed by collecting papers from the arXiv repository for two topics, viz. "complex 
networks" and "graphene", yielding the networks referred to as CN and GF, respectively. For the 
sake of processing times, only the abstracts were considered, and each paper was characterized 
by the frequency of lemmatizecj^ words, disregarding stopword^. Assuming that the frequency 

" http ://ww w. ai'Xiv.org 

'The lemmatization consists in converting words to their canonical form. In this step, verbs are converted into their 
infinitive forni and nouns are converted into their singular' form. 

''Stopwords are highly frequent words conveying little semantic meaning, such as articles and prepositions. 
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of words in papers a and b are given by the vectors and vt, where the element 1^(0 represents 
the frequency of word /, then the content similarity cTah between the two papers is: 



CTab 



iiv^ii-ip^ir 



(1) 



Because cr,,/, giv es the cos i ne of the angle between the vectors, o-gh lies between and 1. As 
reported in Ref. (IMenczen 120041) and verified in both real networks extracted from arXiv, the 
distribution of (Tab for every a citing b follows a normal distribution: 



P(o-ab) = 



1 



V2^ 



: exp 



2^2 



2\ 



(2) 



where ju and are the mean and variance, respectively. 

The other criterion to select the citations is a preferential attachment rule based on the current 
in-degree of a paper Thus, papers with high citation counts are more likely to be cited again, 
according to a power law p{k) oc k^'>'\ where k is the in-degree and is the coefficient of the 
power-law p{k), computed according to the methodology devised in Ref. (Bauk e, 2007). As for 
the criterion of age of publication, the citation count is taken as inversely p roportiona l to the time 
difference At between the article and its references. As observed in Ref. ( Kamalik"aL 2005 ). and 
confirmed in our 2 real citation networks, the power law function p{At) oc Ar''' can be used to 
characterize the likelihood of an article being cited At months after its publication date. 

The simulated networks obtained with only one of the criteria exhibited topological proper- 
ties that differed considerably from the real networks extracted from the arXiv repository for both 
subjects "complex networks" (CN) and "graphene" (GF) (results not shown). This finding is de- 
picted quantitatively by determining the error (see definition in Appendix A) in Tables [1] and 
|2]in the attempt to fit the networks. Excellent agreement was observed, however, when the three 
criteria were combined in an optimization procedure, as shown in Figures[T]and|2]for the CN and 
GF networks. The contribution from each of the criteria (a for topology, /3 for content similarity 
and A for time difference) was computed upon minimizing e^, as described in Appendix A. 



Table 1 : Best model found with the simulated annealing heuristic (see Appendix A). The combination of the three criteria 
gives optimized results, because in the best cases a, /} and A + 0. In other words, the model yielding the minimum error 



e . employs all the three features. 

mm ^ 



Network 


a 


y6 


A 


mm 


Complex Network 


40.0 % 


52.5 % 


7.5 % 


0.056 


Graphene 


5.0% 


45.0 % 


50.0 % 


0.128 



The results in Table [T] indicate that for both networks the similarity of content is an impor- 
tant criterion for selecting references, being responsible for approximately 50 % of the citations. 
The preferential attachment (represented by taking the in-degree into account) was relevant for 
the CN networks, while the age of publication was more relevant for the GN network. Even 
though the content similarity is the most important factor, this does not mean that authors are 
selecting for the list of references the most similar papers to the manuscript being produced. 
This can be observed both in the distribution of figures [Tic) and |2c), which show that only 
a few cited articles are very similar. It is also consistent with the low correlation found be- 
tween th£_actiMnis^^frefe papers in a database in another piece of 
work dAmancio. Nunes. Oliveira Jr.. & Costal 120121) . 
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Table 2: Models obtained with only one factor at a time and comparison with the minimum error e . obtained with 

-' ^ nun 

the models depicted in Table [T] Because f^/f'J,;,, > 1, the model combining the 3 factors is more accurate than those 
considering only one feature. 



Network 


a 




A 


6^6^. 
' mm 


Complex Network 


100.0 % 


0.0% 


0.0% 


3.125 


Complex Network 


0.0% 


100.0 % 


0.0% 


2.104 


Complex Network 


0.0% 


0.0% 


100.0 % 


7.982 


Graphene 


100.0 % 


0.0% 


0.0% 


2.617 


Graphene 


0.0% 


100.0 % 


0.0% 


1.945 


Graphene 


0.0% 


0.0% 


100.0 % 


3.445 



It has to be admitted, nevertheless, that the need to employ distinct parameters for repro- 
ducing the real networks indicates that the three-criterion model is not universal. It cannot 
account for all features of citation networks. This limitation was indeed expected since intu- 
itively one knows that other criteria are important for selectin g references. Perhaps the most 



relevant is the reputation (or visibiUty) of aut hors and journals ( IDe Groote. Shultz. & Doranski 



2005 : Stremersch. Verniers. & Verhoefl 2007 1, which is partially (but not entirely) implicit in the 



in-degree incorporated in our model. We did not include the visibility criterion in the model 
because this type of information is not readily available. For example, not all papers in the arXiv 
database have been published, so it is impossible to use the impact factor of the corresponding 
journals. Regarding the institutions, there is no well established index quantifying their notoriety 
or reputation. As for the authors, use could be made of the ISI.highlycited.com databas^ but 
only a small number of authors are listed. 

We have decided to consider visibility in its possible effects on citation networks, which is 
performed in the next section. 



3. Effects from Visibility on the Evolution of h-index 



In order to analyze how v isibility interferes on the dynamics of t he citation net work, we study 
the evolution of the h-index (Hai 120051: ICostas & Bordoiisl 120071: iHirschL l2005h of authors be- 
longing to two artificial communities with distinct visibility, assuming that the one community 
is twice as visible as the other one. Four models were considered differing in terms of in-degree 
distribution and fitness. In all models, we assume that the number of articles published by an au- 
thor each year follows a power law p(y) - c -f , where p{y) represents the probability distribution 
of y, and y and c are real parameters. These values were determined by defining the endpoints 
{y,p{y)) of the distribution: {\,m) and (s,l). Consequently, we assume that m authors publish one 
paper every year and only one author publishes s papers per year. With these limits, c - m and 



7 = 



log(w) 
log(s) ' 



(3) 



In our experiments we assume m - 15 and s - 30. Note that from p(y) - c y'', it is necessary 
to sample some values of y. Without loss of generality, we chose the following values: y - 



^http://researchanalytics. thomsonreuters.com/highlycited 
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Figure 1: Distributions for the CN (real) network (red) and tlie proposed model (blue). The prop- 
erties predicted by the model incl ude topol ogical (in-degree in graphic (a) and average shortest path 
length iCosta. Rodrigues. Travieso. & V illas Boas, 2007^ in (b)), semantic (content similaiity between a paper and its 
references) in (c)) and temporal (figure d)) features. 



(1,2,3,5,10,15,30) and obtained p(y) - (15,9,6,4,2,2,1). In other words, 15 authors are assumed 
to publish one article every year, 9 authors publish two articles per year and so on. Therefore, 
Na - Piy) - 39 authors and a total of A^^ - 'Zjypiy) - 151 papers were published per 
year. This distribution was assumed for each one of the communities (hereafter referred to as 
communities A and B) and thus a total of 302 papers were published by 78 authors. 

The citation network was represented with a digraph F = (V,E) where the vertices V are 
papers and edges E are established with citations between papers. Because the model is in- 
creased by incorporation of new papers over a period of 25 years, both V and E increased 
with time. In order to distinguish communities A and B with regard to their visibility (i.e., 
the likelihood to receive new citations), we arbitrarily assumed the visibility (fitness) of com- 
munity A as being twice the fitness fs of community B. That is to say, articles in community 
A are twice as likely to be cited. The different values of visibility were adopted to simulate 
diff erences arising due to distinct impac t factor s of journals or authors' institutions, among oth- 
ers ( Bommann. Schier. Marx. & Daniell I2OI2I) . The growth of the citation networks was ob- 
tained for each year. 

The four models used are: (i) UNI: uniform, random selection of references; (ii) PREF: pref- 
erential selection of references depending on the fitness of the community; (iii) PREFC: preferen- 
tial selection for papers with larger in-degree (i.e. highly cited papers); and (iv) DBPREF: prefer- 
ential selection depending on the fitness and in-degree. In the UNI model, each article included is 
assumed to cite w randomly selected published papers. Analogously, in the PREF model random 
papers are cited, but considering the community visibility. The PREFC model is also preferential, 
but here each of the w citations of each article is chosen preferentially for papers with higher ci- 
tation counts, counted from the first to the current year. Therefore, this is similar to the Barabasi- 
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Figure 2: Distributions for the GF (real) network (red) and the proposed model (blue). The properties predicted by the 
model are the same as in Figure 1 . 



Albert model (see Refs. ( Albert & Barabasi|, 2002 ; Bocceiletti. Latora. Moreno. Chavez. & Hwan3, 
2006 ; Costa. Rodrigues. Travieso. & Villas BoasL 2007t Newmanl" 2003 )), except by the fact that 



the in-degree (citation count) is not increased just after the addition of a new paper, but only at 
the end of a year The DBPREF model is preferential both in terms of visibility and in-degree. 
More specifically, a list is kept where the identification of each article is entered a total number 
of times corresponding to the value of its citation count multiplied by the community visibility 
(we assume /4 = 2 and = 1 in order to establish the proportion fAl fs - 2). New citations are 
then chosen by random, uniform selection among the elements in the above list. 

Each of the configurations was performed 20 times to provide statistical representativeness, 
while the h-index and total citation counts were computed for each author each year. The results 
in Figure |3] indicate that including a preferential attachment (PREP) based on the fitness of a 
community has little effect for Community A, whose h-index increases marginally, but a large 
effect for Community B. Indeed, the h-index of all authors in Community B increased at a lower 
rate and after 25 years was considerably lower than that for authors in Community A. In fact, 
the h-index values are much smaller than for the model with random selection (UNI), as will 
be explained later on. These observations apply for w = 5 or 20, though obviously the overall 
h-index values are higher for the networks built with w - 20. 

With regard to the importance of citation counts. Figure H]shows a small increase in h-index 
for w = 5 in comparison with the UNI model ((S^)- In contrast, the h-index values are much lower 
when applying the preferential attachment rule for w = 20 in Figure |4}l than for the random 
case (UNI model in Figure[3}l). When in addition to considering the in-degree (PREFC) we also 
consider the fitness (DBPREF), there is a marginal increase in h-index for community A, but the 
effects are again strong for community B. For the authors of the latter community, the h-index 
values achieved are much lower The only exception appears to be for the author with the largest 
number of papers and w - 20. For some reason, there is a compensation effect in this case, and 



the h-index of this author is not so much lower. Note also that upon applying the preferential 
attachment rule based on the in-degree (for PREFC and DBPREF), the asymmetric distribution 
of citations among papers caused the h-index to be considerably lower than with the UNI or 
PREF models for a fixed number of references. 




YEAR 

(F) 



Figure 3: Dynamics of the li-index using (a) UNI model with w = 5; (b) PREF model for community A with w = 5; (c) 
PREF model for community B with »■ = 5; (d) UNI model with w = 20; (e) PREF model for community A with w = 20; 
and (e) PREF model for community B with w = 20. 



4. Conclusion 



The combination of the three factors, namely content similarity, in-degree and date of pub- 
lication, has been effective in generating a model that reproduced several topological features 
of citation networks. The model represents, therefore, considerable progress compared to the 
literature in explaining the dynamics of citation networks. This applied to two real networks 
obtained from the arXiv repository for the topics "graphenes" and "complex networks", but the 
relative importance of the three factors varied for the networks. While the content similarity 
was the most relevant factor for both networks, the other two had distinct levels of importance 
depending on the network. This network dependence probably highlights the expectation that 
other fact ors are also relevant for the pattern of c itation s as even highly similar articles can be 
forgotten lAmancio. Nunes. Ohveira Jr.. & Costal (l2012h . In fact, the reputation of authors and 
institutions is expected to play an important role, but its quantification is not possible with the 
current databases available. One may argue that the eff'ect from reputation is at least partially 
taken into account when the in-degree is considered, for papers with larger citation count are 
more likely to receive additional citations ( Newman , 2010t Price . 1965 ). But this is only an indi- 
rect manifestation of the reputation, which does not cover the higher visibility that papers from 
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Figure 4: Dynamics of the li-index using (a) PREFC model with w = 5; (b) DBPREF model for community A with 
w = 5; (c) DBPREF model for community B with w = 5; (d) PREFC model with w = 20; (e) DBPREF model for 
community A with w = 20; and (e) DBPREF model for community B with w = 20. 



renowned authors and institutions have right after being published (when the citation count is 
still small or zero). 

Owing to the importance of the visibility (or reputation) factor, we decided to verify its 
effects on the evolution of h-index of authors by considering artificial citation networks. For 
the latter we showed that the community with higher fitness (i.e. higher probability of having 
their paper being cited) - benefit only marginally - in terms of their h-index - in comparison 
with a control citation network with no bias. This increase in the h-index of prominent authors 
probably occurs because the h-index is Lotkaian (follows a power law distribution) and th erefore 



the co ncentration effect might be a reinforcement effect in three dimensional inf orme trie s lE g ghe 



j2005h . In contrast, communities with less visibility can be hit hard, as their h-index values 
could be considerably lower than those estimated for the control, unbiased network. This finding 
confirms the observati on in real networks that h-values depend on th e productivity and citation 
practices of given fields lAlonso. Cabrerizo. Viedma & Herreral(l2009h . Therefore, caution should 
be taken when using the h-index to assess authors from distinct communities. 



Appendix A - Setting Up the Parameters of the Model 

Given the 3 probability distributions concerning topological, semantic and temporal features, 
the model selects by chance one of these distributions to choose a paper to be included in its 
reference list. The prominence of each model is set according to the value of 2 thresholds: ti - a 
and f2 = a + /S. In other words, if the random number such that < «,- < 1, is less than t\, 
then the p{k) distribution is chosen. On the other hand, if ti < tir < fa, then the content similarity 
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p(cr) is chosen. Otherwise, if f2 < n,- < 1, then p(At) is selected. Thus, the prominence of the 
topological, semantic and temporal factors are given respectively by a, /3 and A = I - a - /3. In 
order to optimize the model, we minimized the following error e- : 



e = 



\ \ .V.,„-Jg 
6 



(4) 



whose parameters are explained in Table |3] 



Table 3: List of variables in the model. 



Variable 


Meaning 


7k,m 


Power law coefficient for the in-degree 




of the network obtained from the model. 


7k,,- 


Power law coefficient for the in-degree in the real network. 




Average shortest path length of the network obtained from the model. 




Average shortest path length of the real network. 


Sl,m 


Standard deviation of the shortest path 




length for the network obtained from the model. 


Sl,r 


Standard deviation of the network obtained from the real network. 




Average content similarity between an article and its references for the model. 




Average content similarity between an article and its references for the real network. 




Standard deviation of the content similarity 




between an article and its references for the model. 




Standard deviation of the content similarity between 




an article and its references for the real network. 


Jt.m 


Power law coefficient of A, for the network obtained from the model. 


Jt.r 


Power law coefficient of A, for the real network. 



The weights were distributed in order to give equal weighting to the three factors (1/6 + 
1 /12 + 1 /12 = 1/3 for topology, 1/6 + 1/6 = 1/3 for semantics and 1 /3 for the temporal feature). 
Be cause the brute-force search is impracticable, w e made use of simulated annealing heuris- 
tic ( Press. Teukolskv. Vetterling. & Flannervi 2007 ) in the simulations to minimize the error e^. 
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